19 research outputs found

    Social media mining for identification and exploration of health-related information from pregnant women

    Get PDF
    Widespread use of social media has led to the generation of substantial amounts of information about individuals, including health-related information. Social media provides the opportunity to study health-related information about selected population groups who may be of interest for a particular study. In this paper, we explore the possibility of utilizing social media to perform targeted data collection and analysis from a particular population group -- pregnant women. We hypothesize that we can use social media to identify cohorts of pregnant women and follow them over time to analyze crucial health-related information. To identify potentially pregnant women, we employ simple rule-based searches that attempt to detect pregnancy announcements with moderate precision. To further filter out false positives and noise, we employ a supervised classifier using a small number of hand-annotated data. We then collect their posts over time to create longitudinal health timelines and attempt to divide the timelines into different pregnancy trimesters. Finally, we assess the usefulness of the timelines by performing a preliminary analysis to estimate drug intake patterns of our cohort at different trimesters. Our rule-based cohort identification technique collected 53,820 users over thirty months from Twitter. Our pregnancy announcement classification technique achieved an F-measure of 0.81 for the pregnancy class, resulting in 34,895 user timelines. Analysis of the timelines revealed that pertinent health-related information, such as drug-intake and adverse reactions can be mined from the data. Our approach to using user timelines in this fashion has produced very encouraging results and can be employed for other important tasks where cohorts, for which health-related information may not be available from other sources, are required to be followed over time to derive population-based estimates.Comment: 9 page

    A Chronological and Regional Analysis of Personal Reports of COVID-19 on Twitter from the UK

    Get PDF
    OBJECTIVE: Given the uncertainty about the trends and extent of the rapidly evolving COVID-19 outbreak, and the lack of extensive testing in the United Kingdom, our understanding of COVID-19 transmission is limited. We proposed to use Twitter to identify personal reports of COVID-19 to assess whether this data can help inform as a source of data to help us understand and model the transmission and trajectory of COVID-19. METHODS: We used natural language processing and machine learning framework. We collected tweets (excluding retweets) from the Twitter Streaming API that indicate that the user or a member of the user's household had been exposed to COVID-19. The tweets were required to be geo-tagged or have profile location metadata in the UK. RESULTS: We identified a high level of agreement between personal reports from Twitter and lab-confirmed cases by geographical region in the UK. Temporal analysis indicated that personal reports from Twitter appear up to 2 weeks before UK government lab-confirmed cases are recorded. CONCLUSIONS: Analysis of tweets may indicate trends in COVID-19 in the UK and provide signals of geographical locations where resources may need to be targeted or where regional policies may need to be put in place to further limit the spread of COVID-19. It may also help inform policy makers of the restrictions in lockdown that are most effective or ineffective

    ZooPhy: A bioinformatics pipeline for virus phylogeography and surveillance

    Get PDF
    ObjectiveWe will describe the ZooPhy system for virus phylogeography and public health surveillance [1]. ZooPhy is designed for public health personnel that do not have expertise in bioinformatics or phylogeography. We will show its functionality by performing case studies of different viruses of public health concern including influenza and rabies virus. We will also provide its URL for user feedback by ISDS delegates.IntroductionSequence-informed surveillance is now recognized as an important extension to the monitoring of rapidly evolving pathogens [2]. This includes phylogeography, a field that studies the geographical lineages of species including viruses [3] by using sequence data (and relevant metadata such as sampling location). This work relies on bioinformatics knowledge. For example, the user first needs to find a relevant sequence database, navigate through it, and use proper search parameters to obtain the desired data. They also must ensure that there is sufficient metadata such as collection date and sampling location. They then need to align the sequences and integrate everything into specific software for phylogeography. For example, BEAST [4] is a popular tool for discrete phylogeography. For proper use, the software requires knowledge of phylogenetics and utilization of BEAUti, its XML processing software. The user then needs to use other software, like TreeAnnotator [4], to produce a single (“representative”) maximum clade credibility (MCC) tree. Even then, the evolutionary spread of the virus can be difficult to interpret via a simple tree viewer. There is software (such as SpreaD3 [5]) for visualizing a tree within a geographic context, yet for novice users, it might not be easy to use. Currently, there are only a few systems designed to automate these types of tasks for virus surveillance and phylogeography.MethodsWe have developed ZooPhy, a pipeline for sequence-informed surveillance and phylogeography [1]. It is designed for health agency personnel that do not have expertise in bioinformatics or phylogeography. We created a large database of all virus sequences and metadata from GenBank [6] as well as a smaller database for selected viruses perceived to be of great interest for health agencies including: influenza (A, B, and C), Ebola, rabies, West Nile virus, and Zika virus.In Figure 1A, we show our front-end architecture, created in the style of the influenza research database [7], that enables the user to search by: virus, gene name, host, time-frame, and geography. We also allow users to upload their own list of GenBank accessions or unpublished sequences. Hitting “Search” produces a Results tab which includes the metadata of the sequences. We provide a feature to randomly down-sample by a specified percentage or number. We also allow the user to download the metadata in CSV format or the unaligned sequences in FASTA format.The final tab, "Run", includes a text box for specifying an email in order to send job updates and final results on virus spread. We also enable for the user to study the influence of predictors on virus spread (via a generalized linear model). Currently, we have predictors such as temperature, great circle distance, population, and sample size for selected countries. We also offer experts the ability to specify advanced modeling parameters including the molecular clock type (strict vs. relaxed), coalescent tree prior, and chain length and sampling frequency for the Markov-chain Monte Carlo. When the user selects “Start ZooPhy”, a pre-processor eliminates incomplete or non-disjoint record locations and sends the rest for analysis.ResultsWhen initiated, the ZooPhy pipeline includes sequence alignment via Mafft [8] and creation of an XML template via BEASTGen for input into BEAST for discrete phylogeography. It then uses TreeAnnotator [3] to create an MCC tree from the posterior distribution of sampled trees. ZooPhy uses the MCC as input into SpreaD3 for a recreation of the time-estimated migration via a map. If the user selects the GLM option, the system runs an R script to calculate the Bayes factor of the inclusion probability for each predictor and draws a plot including the regression coefficient and its 95% Bayesian credible interval. We are currently working on new visualization techniques such as those demonstrated by Dudas et al. that combine time-oriented spread via a map and evolution on a phylogenetic tree annotated by discrete locations [9].ConclusionsRecent advances in phylodynamics, bioinformatics, and visualization have demonstrated the potential of pipelines to support surveillance. One example is NextStrain which can perform real-time virus phylodynamics [10]. The system has recently been added as an app to the Global Initiative on Sharing Avian Influenza Data (GISAID) database for influenza tracking using DNA sequences [11]. This presentation will highlight a pipeline for virus phylogeography designed for epidemiologists who are not experts in bioinformatics but wish to leverage virus sequence data as part of routine surveillance. We will describe the development and implementation of our system, ZooPhy, and use real-world case studies to demonstrate its functionality. We invite ISDS delegates to use the system via our web portal, https://zodo.asu.edu/zoophy/ and provide feedback on system utilization.References1. Scotch, M., et al., At the intersection of public-health informatics and bioinformatics: using advanced Web technologies for phylogeography. Epidemiology, 2010. 21(6), 764-768.2. Gardy, J.L. and N.J. Loman, Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet, 2018. 19: p. 9-20.3. Avise, J.C., Phylogeography : the history and formation of species. 2000, Cambridge, Mass.: Harvard University Press.4. Suchard, M.A., et al., Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol, 2018. 4.5. Bielejec, F., et al., SpreaD3: Interactive Visualization of Spatiotemporal History and Trait Evolutionary Processes. Mol Biol Evol, 2016. 33(8): p. 2167-9.6. Benson, D. A.,et al., GenBank. Nucleic Acids Res, 2018. 46, p. D41-D47.7. Zhang, Y., et al., Influenza Research Database: An integrated bioinformatics resource for influenza virus research. Nucleic Acids Res, 2017. 45: p. D466-D474.8. Katoh, K. and D.M. Standley, MAFFT: iterative refinement and additional methods. Methods Mol Biol, 2014. 1079: p. 131-46.9. Dudas, G., et al., Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature, 2017. 544(7650): p. 309-315.10. Hadfield, J., et al., Nextstrain: real-time tracking of pathogen evolution. Bioinformatics, 2018.11. NextFlu. 2018; Available from: https://www.gisaid.org/epiflu-applications/nextflu-app/.

    pdm09 MCCs

    No full text
    Maximum Clade Credibility (MCC) Trees for each model/sample of pdm09 in Mexico and USA

    H5N1 MCCs

    No full text
    Maximum Clade Credibility (MCC) Trees for each model/sample of H5N1 in Egypt

    Empirical Trees for pdm09 MCMCs

    No full text
    Starting empirical trees used for pdm09 MCMCs. Use this for the pdm09 BEAST XMLs
    corecore